Rewrite dask-jobqueue with SpecCluster#307
Conversation
This is the first step towards rewriting with SpecCluster I mostly copied the implementations from the Cluster classes, but then removed the cluster bits
Fix flake8 as well.
|
For some reason, I could not get the PBS test to pass locally (basically the worker never shows up and I was unable to debug this). I managed to get a very similar test passing with SGE though and I pushed into your PR branch, hope you don't mind. |
|
Thanks @lesteve ! I'm very glad to see this. Both because it's always nice to have help (you're very welcome to push to this branch or any other I have) and because it's nice to see that this approach is accessible. |
|
I'll make PBSCluster/SGECluster functions next to get your feedback on how they feel. |
|
OK, I've added a rough draft using SpecCluster to build out a generic JobQueueCluster implementation. It should compose well with the Job classes. It's currently tricky to pass around the keyword arguments correctly, but it's certainly less work than before. If anyone has an opportunity to try this out on a real cluster and provide feedback, that would be welcome. |
|
OK, this use of job-name seems to work, at least for PBS. I'm not sure what's up with SGE. This change is because we are now using the |
|
I have looked at this quickly: at the moment in On my cluster if I try something like this: from dask_jobqueue import SGEJob
from dask_jobqueue.job import JobQueueCluster
from dask.distributed import Client
env_extra = ['source /sequoia/data1/lesteve/miniconda3/etc/profile.d/conda.sh',
'conda activate dask-dev',
]
resource_spec = 'h_vmem=10G,mem_req=500M'
queue = 'all.q'
cluster = JobQueueCluster(1, Job=SGEJob, queue=queue, env_extra=env_extra,
cores=1, processes=1,
memory='16GB',
resource_spec=resource_spec,
interface='ib0'
)I see a job with Log from the worker: It looks like None is passed as the scheduler address ( |
|
SpecCluster starts the scheduler and then passes that address to each of the workers as they start as the first argument. I think the problem is that we're not passing through |
|
Tests are now passing. Thanks for pointing out that we weren't getting the scheduler address @lesteve , that was the issue. |
|
Some next steps:
|
|
OK, I think that I've addressed all comments. @lesteve I also removed the xfail on the adaptive test (things work upstream now). Tests pass. From my perspective this is good to go. |
|
I think this looks good, in particular the code to convert from I am going to merge this PR. I feel this has already been under review long enough and the only contentious point left was the It would be great if people involved in this PR could test Thanks a lot to everyone involved in this PR and in particular @mrocklin for doing the bulk of the work! Another thing to tidy up in another PR is to stop using |
|
Thanks all! Glad to see this brought to completion. |
|
Woot! |
|
🚀 |
|
🎉 |
|
Next question: what should happen before we release? It would be nice if people could try out I think that I've covered all of the concerns that @lesteve had (except for the |
|
I had this mostly in mind before the release:
It would have been great since we are introducing breaking changes to potentially think about introducing a few more, e.g. #323 and #205 but I don't think that is critical. |
This is the first step towards rewriting with SpecCluster
I mostly copied the implementations from the Cluster classes,
but then removed the cluster bits
For people reviewing, I recommend looking at the tests first.